Movie Rating Model and Predictor

Part 5: Modeling

The aim at this stage was to develop two prediction models. Model One, a simple linear regression model identified which of the quantitative variables was the best predictor of the log of daily box office revenue. The best predictor for the log of daily box office revenue identified by Model One was designated the response variable for Model Two: a multiregression model that selected the best predictors for the designated response variable. The latter was the best performing of four multiregression models, developed using both forward selection and backward elimination method selection methods. These four models and their model selection methods were:
Table 1: Multiregression prediction models
Model Model.Selection Data
Alpha Forward Selection Full model
Beta Forward Selection Full model, influential outliers removed
Gamma Backward Elimination Full model
Delta Backward Elimination Full model, influential outliers removed

The remainder of this sections is organized as follows.

  1. Model One: Simple Linear Regression Model
    1.1. Model Selection
    1.2. Model Diagnostics
    1.3. Model Interpretation

  2. Model Two: Multiregression Model
    2.1. Model Selection Methods
    2.2. Full Model
    2.3. Model Alpha
    2.4. Model Beta
    2.5. Model Gamma
    2.6. Model Delta
    2.7. Model Comparison
    2.8. Model Two: Final Multiregression Model

  3. Model Summary

Model One: Simple Linear Regression

Model Selection

Several simple linear models were fit to determine which of the following quantititive variables in Table 2 was the best predictor of the log of daily box office revenue.

Table 3: Simple linear regression variables
Variable Description
audience_score Audience score on Rotten Tomatoes
cast_experience The sum across all cast members for a film, of the number of films in which each actor appeared
cast_experience_log Log of the sum across all cast members for a film, of the number of films in which each actor appeared
cast_scores Total number of allocated audience and IMDB scores per day for the cast of a film
cast_scores_log Log of cast_scores
cast_votes Total number of allocated IMDB votes per day for the cast of a film
cast_votes_log Log of cast_votes
critics_score Critics score on Rotten Tomatoes
director_experience Total number of films in sample for a director
director_experience_log Log of the total number of films directed by the film’s director
imdb_num_votes Number of votes on IMDB
imdb_num_votes_log Log number of IMDB votes
imdb_rating Rating on IMDB
runtime Runtime of movie (in minutes)
runtime_log Log runtime of movie (in minutes)
votes_per_day The number of IMDB Votes / thtr_days
votes_per_day_log Log of votes_per_day

As suggested by the correlation analysis in Table 4 and summarized in Table 5 the log number of IMDB votes was the best predictor of the log of daily box office revenue (F(1, 177) = 332.237, p < .001), with an adjusted R-Squared of 0.65. The model accounted for 65% of the variance in the response.

Table 5: Best performing simple linear regression on log of box office revenue
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
imdb_num_votes_log 1 1355.05 1355.05 332.24 0 65.24
Residuals 177 721.90 4.08 NA NA 34.76

Model Diagnostics

Linearity

The linearity of the predictor with the log of daily box office is illustrated in Figure 1.

Figure 1 Model One linearity plot

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(2), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 2) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 2 Model One homoscedasticity plot

The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.198). As such the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 3 illustrate the distribution of residuals.

Figure 3 Model One residuals plot

The histogram and normal Q-Q plot did not suggest a normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.996, p = 0.885) and the skewness (-0.016) and kurtosis (3.147) indicated that normality of residuals was not a reasonable assumption for this model.

Outliers

Figure 4 Model One Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 10 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points were not removed from the model.

Model Interpretation

The final prediction equation was defined as follows:
\(y_i\) = -7.29 + 1.24\(x_1\) + \(\epsilon\)

where:
\(x_1\) is imdb_num_votes_log

Analysis of Variance

Figure 5 summarizes the analysis of variance.
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
imdb_num_votes_log 1 1355.045 1355.045 332.237 0 65.24
Residuals 177 721.903 4.079 NA NA 34.76

Figure 5 Model Alpha analysis of variance

A two-way analysis of variance was conducted on the influence of 1 independent variable on the log daily box office. The significance of imdb_num_votes_log on the log daily box office yielded an F statistic of F(1, 177), = 332.237, p < .001, accounting for 65.24% of the variance. Finally, residuals accounted for a 34.76% of variance. The model was significant (F(2, 177) = 332.237, p < .001), with an adjusted R-squared of 0.65.

Interpretation of Coefficients

The intercept -7.29 is the prediction of log daily box office revenue for a film where the log number of IMDB votes is zero. The prediction of the log daily box office (in log dollars) is therefore, -7.29 plus 1.24 log dollars of daily box office revenue for each log IMDB vote.

Model Two: Multiple Linear Regression

Model Two was the best performing of models Alpha, Beta, Gamma, and Delta. The following provides an overview of the model selection methods used, then each model is described and diagnosed vis-a-vis assumptions of linearity, homoscedasticity, normality of errors, multicollinearity, and the treatment of influential points.

Model Selection Methods

Both forward selection and backward elimination with p-values model selection techniques were used. The forward selection approach optimized adjusted r-squared; whereas the backward elimination method was based upon p-values.

Forward Selection

The forward selection process began with a null model then all variables were added to the model, one-by-one, and the model which provided the greatest improvement over the current best adjusted R-squared was selected. The process repeated with each variable that was not already in the model until all variables were analyzed. Only the models that improved adjusted r-squared were retained at each step.

Backward Elimination

The backward elimination approach began with the full model. A regression analysis was performed and the least significant predictor (that with the highest p-value) was removed from the model. This process repeated, removie only the most least significant predictor at each step, until all predictors had p-values below the present threshold.

Full Model Selection

Since the objective of the analysis was to determine what factors make a movie popular, the full model did not include variables that could be considered proxies of popularity such as audience rating or IMDB rating. Such ratings are measures of a film’s popularity, not predictors. Critics rating, on the other hand, was considered not a measure, but a potential leading indicator of movie popularity. Similarly, effort was made to capture the popularity of specific cast members to test the hypothesis that a cast’s aggregate popularity could influence the popularity of a film. That said, the criteria for excluding a variable from the full model was as follows:
* Measures of film popularity such as the audience rating, IMDB rating and top 200 box office variables
* Categorical variables with levels including less than 5 observations, such as title, url, studio, and the actor variables
* The year and day of theatrical or dvd release

As such, the following full model is presented in Table 6.
Type Variable Description
Categorical best_actor_win Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
Categorical best_actress_win Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
Categorical best_dir_win Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
Categorical best_pic_nom Whether or not the movie was nominated for a best picture Oscar (no, yes)
Categorical best_pic_win Whether or not the movie won a best picture Oscar (no, yes)
Categorical genre Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
Categorical mpaa_rating MPAA rating of the movie (G, PG, PG-13, R, Unrated)
Categorical thtr_rel_month Month the movie is released in theaters
Numeric cast_scores Total number of allocated audience and IMDB scores per day for the cast of a film
Numeric cast_votes_log Log of cast_votes
Numeric critics_score Critics score on Rotten Tomatoes
Numeric director_experience_log Log of the total number of films directed by the film’s director
Numeric runtime_log Log runtime of movie (in minutes)
Numeric votes_per_day_log Log of votes_per_day

The following sections explore various models, model selection techniques, and model diagnostics. Comparisons are conducted and the models are evaluated on test data for prediction accuracy and stability. Lastly, the best performing model is selected and described on detail.

Model Alpha

For this model, a forward selection procedure was undertaken based upon the full model described above. Table 7 lists the variables in the order in which they were added.

Table 7: Model Alpha forward selection process
Step Selected Model.Size DF F.statistic R.Squared Adjusted.R2 p.value Pct Chg
1 cast_scores 1 2 481 110.31 0.19 0.18 0 0.00
2 genre 2 12 471 20.52 0.32 0.31 0 66.49
3 critics_score 3 13 470 21.46 0.35 0.34 0 9.74
4 mpaa_rating 4 17 466 17.73 0.38 0.36 0 5.62
5 cast_votes 5 18 465 17.38 0.39 0.37 0 2.52
6 best_pic_nom 6 19 464 16.98 0.40 0.37 0 2.19
7 runtime_log 7 20 463 16.42 0.40 0.38 0 1.07
8 best_dir_win 8 21 462 15.76 0.41 0.38 0 0.53

As indicated in Table 8 and graphically depicted in Figure 6, the model was significant (F(21, 462) = 15.759, p < .001), with an adjusted R-squared of 0.38.

Table 8: Model Alpha Summary Statistics
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Alpha 8 21 462 15.759 1.85 1.85 0.406 0.38 0 40.554

Figure 6 Model Alpha Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 7.

Figure 7 Model Alpha linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(21), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 8) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 8 Model Alpha homoscedasticity plot

The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.253). As such the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 9 illustrate the distribution of residuals.

Figure 9 Model Alpha residuals plot

The histogram and normal Q-Q plot did not suggest a normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.992, p = 0.008) and the skewness (0.147) and kurtosis (2.444) indicated that normality of residuals was not a reasonable assumption for this model.

Multicollinearity

As shown in Figure 10 and Table 9, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 3 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 10: Model Alpha correlations among quantitative predictors

Table 9 Model Alpha variance inflation Factors
GVIF Df GVIF^(1/(2*Df))
cast_scores 2.545 1 1.595
genre 3.024 10 1.057
critics_score 1.427 1 1.195
mpaa_rating 2.516 4 1.122
cast_votes 2.281 1 1.510
best_pic_nom 1.136 1 1.066
runtime_log 1.429 1 1.195
best_dir_win 1.100 1 1.049
Outliers

Figure 11 Model Alpha Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 23 cases exerting undue influence on the model. The discern the effect of these outliers on the model, a new model (Model B) was created without the outliers removed.

Model Beta

This was also a forward selecion model; however, it was based upon the full model with outliers from Model Alpha removed. The variables were added as described in Table 10

Table 10: Model Beta forward selection process
Step Selected Model.Size DF F.statistic R.Squared Adjusted.R2 p.value Pct Chg
1 cast_scores 1 2 458 116.97 0.20 0.20 0 0.00
2 genre 2 12 448 21.80 0.35 0.33 0 64.85
3 critics_score 3 13 447 22.64 0.38 0.36 0 8.41
4 cast_votes 4 14 446 22.75 0.40 0.38 0 5.54
5 mpaa_rating 5 18 442 18.65 0.42 0.40 0 3.67
6 best_pic_nom 6 19 441 18.31 0.43 0.40 0 2.28
7 runtime_log 7 20 440 17.74 0.43 0.41 0 1.24
8 best_pic_win 8 21 439 17.01 0.44 0.41 0 0.49

As indicated in Table 11 and graphically depicted in Figure 12, the model was significant (F(21, 439) = 17.013, p < .001), with an adjusted R-squared of 0.411.

Table 11: Model Beta Summary Statistics
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Beta 8 21 439 17.013 1.759 1.759 0.437 0.411 0 43.664

Figure 12 Model Beta Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 13.

Figure 13 Model Beta linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(21), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 14) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 14 Model Beta homoscedasticity plot

The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.658). As such the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 15 illustrate the distribution of residuals.

Figure 15 Model Beta residuals plot

The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.995, p = 0.131) and the skewness (0.127) and kurtosis (2.604) supported the assumption of normaility.

Multicollinearity

As shown in Figure 16 and Table 12, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 3.3 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 16: Correlations among quantitative predictors

Table 12 Model Beta Variance Inflation Factors
GVIF Df GVIF^(1/(2*Df))
cast_scores 2.569 1 1.603
genre 3.308 10 1.062
critics_score 1.412 1 1.188
cast_votes 2.306 1 1.518
mpaa_rating 2.448 4 1.118
best_pic_nom 1.320 1 1.149
runtime_log 1.519 1 1.233
best_pic_win 1.230 1 1.109
Outliers

Figure 17 Model Beta Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 24 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.

Model Gamma

For this model, a backward elimination procedure was undertaken based upon the full model The variables were removed as described in Table 13

Table 13: Model Gamma
Steps Removed p.value
1 director_experience 0.93
2 cast_scores_log 0.93
3 director_experience_log 0.69
4 cast_votes_log 0.66
5 best_actress_win 0.34

The model therefore retained the following variables:

Table 14 Model Gamma Variables
Variable Description
genre Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime_log Log runtime of movie (in minutes)
mpaa_rating MPAA rating of the movie (G, PG, PG-13, R, Unrated)
thtr_rel_month Month the movie is released in theaters
critics_score Critics score on Rotten Tomatoes
best_pic_nom Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_dir_win Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
cast_votes Total number of allocated IMDB votes per day for the cast of a film
cast_scores Total number of allocated audience and IMDB scores per day for the cast of a film

As indicated in Table 15 and graphically depicted in Figure 18, the model was significant (F(34, 449) = 9.866, p < .001), with an adjusted R-squared of 0.378.

Table 15 Model Gamma Summary Statistics
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Gamma 11 34 449 9.866 1.853 1.853 0.42 0.378 0 42.032

Figure 18 Model Gamma Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 19.

Figure 19 Model Gamma linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(34), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 20) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 20 Model Gamma homoscedasticity plot

The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.29). As such the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 21 illustrate the distribution of residuals.

Figure 21 Model Gamma residuals plot

The histogram and normal Q-Q plot did not suggest a normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.99, p = 0.003) and the skewness (0.149) and kurtosis (2.397) indicated that normality of residuals was not a reasonable assumption for this model.

Multicollinearity

As shown in Figure 22 and Table 16, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 3.9 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 22: Correlations among quantitative predictors

Table 16 Model Gamma Variance Inflation Factors
GVIF Df GVIF^(1/(2*Df))
genre 3.854 10 1.070
mpaa_rating 2.760 4 1.135
thtr_rel_month 1.752 11 1.026
best_pic_nom 1.445 1 1.202
best_pic_win 1.365 1 1.168
best_actor_win 1.297 1 1.139
best_dir_win 1.212 1 1.101
critics_score 1.457 1 1.207
runtime_log 1.551 1 1.246
cast_scores 2.614 1 1.617
cast_votes 2.429 1 1.559
Outliers

Figure 23 Model Gamma Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 17 cases exerting undue influence on the model. To discern the effect of the influential points on the model, a new model (Model Delta) was created without the influential points of this model.

Model Delta

This was also a backward elimination model; however, it was based upon the full model with outliers from Model Gamma removed. The variables were removed as described in Table 17

Table 17: Model Delta
Steps Removed p.value
1 director_experience 0.99
2 cast_scores_log 0.83
3 director_experience_log 0.78
4 cast_votes_log 0.49
5 best_actress_win 0.29

The model therefore retained the following variables:

Table 18 Model Delta Variables
Variable Description
genre Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime_log Log runtime of movie (in minutes)
mpaa_rating MPAA rating of the movie (G, PG, PG-13, R, Unrated)
thtr_rel_month Month the movie is released in theaters
critics_score Critics score on Rotten Tomatoes
best_pic_nom Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_dir_win Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
cast_votes Total number of allocated IMDB votes per day for the cast of a film
cast_scores Total number of allocated audience and IMDB scores per day for the cast of a film

As indicated in Table 19 and graphically depicted in Figure 24, the model was significant (F(34, 432) = 10.953, p < .001), with an adjusted R-squared of 0.414.

Table 19
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Delta 11 34 432 10.953 1.761 1.761 0.456 0.414 0 45.554

Figure 24 Model Delta Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 25.

Figure 25 Model Delta linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(34), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 26) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 26 Model Delta homoscedasticity plot

The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.499). As such the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 27 illustrate the distribution of residuals.

Figure 27 Model Delta residuals plot

The histogram and normal Q-Q plot did not suggest a normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.991, p = 0.006) and the skewness (0.203) and kurtosis (2.506) indicated that normality of residuals was not a reasonable assumption for this model.

Multicollinearity

As shown in Figure 28 and Table 20, collinearity appeared extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4.1 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration. The multicollinearity assumption was not met for this model.
Figure 28: Correlations among quantitative predictors

Table 20 Model Delta Variance Inflation Factors
GVIF Df GVIF^(1/(2*Df))
genre 4.102 10 1.073
mpaa_rating 2.655 4 1.130
thtr_rel_month 1.848 11 1.028
best_pic_nom 1.458 1 1.207
best_pic_win 1.366 1 1.169
best_actor_win 1.281 1 1.132
best_dir_win 1.226 1 1.107
critics_score 1.463 1 1.210
runtime_log 1.651 1 1.285
cast_scores 2.626 1 1.620
cast_votes 2.482 1 1.575
Outliers

Figure 29 Model Delta Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 18 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.

Model Comparisons

To summarize, models Alpha and Beta were constructed using forward selection and models Gamma and Delta were developed via backward elimination. Models Beta and Delta were fitted without the influential data points from models Alpha and Gamma respectively.

Table 21 Summary of models
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Alpha 8 21 462 15.759 1.850 1.850 0.406 0.380 0 40.554
Model Beta 8 21 439 17.013 1.759 1.759 0.437 0.411 0 43.664
Model Gamma 11 34 449 9.866 1.853 1.853 0.420 0.378 0 42.032
Model Delta 11 34 432 10.953 1.761 1.761 0.456 0.414 0 45.554

Forward Selection vs. Backward Elimination

The differences in root mean square error for the models was not significant 0.17% and -0.12%. Similarly, the differences in adjusted R-squared were 0.55% and 0.72%, not a significant difference. Lastly the differences in the percent variance explained by the models also lacking in significance (3.64% and 4.33%).

Influential Points: Drop or Not

The Beta and Delta models were trained on data sans the influential points from Alpha and Gamma. The differences in RMSE (5.21% and 5.25%) were insignificant, as were the differences in adjusted R-squared (8.21% and 9.59%), and the percent of variance explained (7.67% and 8.38%). However, a case-wise review of the influential points did not reveal any data quality issues; therefore, the points would not be removed.

Prediction Accuracy

The evaluate the effects of model selection method and the treatment of outliers on prediction accuracy, the four multiregression models were evaluated for prediction accuracy on the test data. Four measures of prediction accuracy were used:

  1. MAPE - Mean Absolute Percentage Error
  2. MPE - Mean Percentage Error
  3. MSE - Mean Squared Error
  4. RMSE - Root Mean Squared Error

In addition, a percent accuracy measure was computed as the percentage of the observations in the test set in which the actual log number of IMDB votes fell within the prediction interval.

Table 22 Model Predictive Accuracy Summary
Model Size F Statistic R-Squared Adj R-Squared % Variance MAPE MPE MSE RMSE % Accuracy
Model Alpha 8 15.759 0.406 0.380 40.554 11.489 -2.906 3.524 1.877 95.902
Model Beta 8 17.013 0.437 0.411 43.664 11.466 -2.979 3.519 1.876 94.262
Model Gamma 11 9.866 0.420 0.378 42.032 11.365 -3.315 3.461 1.860 95.902
Model Delta 11 10.953 0.456 0.414 45.554 11.185 -3.278 3.419 1.849 96.721

There were no significant differences in MAPE, MSE, and RMSE between the models. The negative MPE indicated that all models were biased with over predictions. From a percent accuracy perspective, it is worth noting that the forward selection and backward selection models performed nearly identically with and without the influence points. That said, the models with the influence points had 0 greater prediction accuracy. Having the highest percent accuracy, the Alpha model would advance to the prediction stage.

Model Two: Final Multiregression Model

The final prediction equation was defined as follows: \(y_i\) = 8.355864 + 0.001\(x_1\) + -0.743\(x_2\) + -2.541\(x_3\) + -1.329\(x_4\) + -4.065\(x_5\) + -2.028\(x_6\) + -0.681\(x_7\) + -3.629\(x_8\) + -1.454\(x_9\) + -1.818\(x_{10}\) + -0.528\(x_{11}\) + 0.017\(x_{12}\) + -0.108\(x_{13}\) + 0.751\(x_{14}\) + 0.363\(x_{15}\) + -0.847\(x_{16}\) + 0\(x_{17}\) + 1.052\(x_{18}\) + 0.765\(x_{19}\) + 0.524\(x_{20}\) + \(\epsilon\)

where: \(x_1\) is cast_scores
\(x_2\) is genreAnimation
\(x_3\) is genreArt House & International
\(x_4\) is genreComedy
\(x_5\) is genreDocumentary
\(x_6\) is genreDrama
\(x_7\) is genreHorror
\(x_8\) is genreMusical & Performing Arts
\(x_9\) is genreMystery & Suspense
\(x_{10}\) is genreOther
\(x_{11}\) is genreScience Fiction & Fantasy
\(x_{12}\) is critics_score
\(x_{13}\) is mpaa_ratingPG
\(x_{14}\) is mpaa_ratingPG-13
\(x_{15}\) is mpaa_ratingR
\(x_{16}\) is mpaa_ratingUnrated
\(x_{17}\) is cast_votes
\(x_{18}\) is best_pic_nomyes
\(x_{19}\) is runtime_log
\(x_{20}\) is best_dir_winyes

The genre, MPAA rating and month of release variables were code 0 or 1 in accordance with the genre, MPAA rating and month of release for each observation.

Analysis of Variance

Figure 30 summarizes the analysis of variance.
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
cast_scores 1 496.294 496.294 144.988 0.000 18.66
genre 10 365.435 36.544 10.676 0.000 13.74
critics_score 1 80.027 80.027 23.379 0.000 3.01
mpaa_rating 4 64.886 16.222 4.739 0.001 2.44
cast_votes 1 27.065 27.065 7.907 0.005 1.02
best_pic_nom 1 22.831 22.831 6.670 0.010 0.86
runtime_log 1 14.223 14.223 4.155 0.042 0.53
best_dir_win 1 8.099 8.099 2.366 0.125 0.30
Residuals 462 1581.424 3.423 NA NA 59.45

Figure 30 Model Alpha analysis of variance

A two-way analysis of variance was conducted on the influence of 8 independent variables on the log imdb votes. The force of cast_scores on the log imdb votes indicated an F statistic of F(1, 462), = 144.988, p < .001, accounting for 18.66% of the variance. The significance of genre on the log imdb votes presented an F statistic of F(10, 462), = 10.676, p < .001, exhibiting 13.74% of the variance. The significance of critics_score on the log imdb votes produced an F statistic of F(1, 462), = 23.379, p < .001, accounting for 3.01% of the variance. The influence of mpaa_rating on the log imdb votes produced an F statistic of F(4, 462), = 4.739, p < .001, representing 2.44% of the variance. The significance of cast_votes on the log imdb votes presented an F statistic of F(1, 462), = 7.907, p < .01, representing 1.02% of the variance. The force of best_pic_nom on the log imdb votes presented an F statistic of F(1, 462), = 6.67, p < .05, representing 0.86% of the variance. The significance of runtime_log on the log imdb votes indicated an F statistic of F(1, 462), = 4.155, p < .05, exhibiting 0.53% of the variance. The effect of best_dir_win on the log imdb votes produced an F statistic of F(1, 462), = 2.366, p < 0.125, exhibiting 0.3% of the variance. Finally, residuals represented a 59.45% of variance. The model was significant (F(21, 462) = 15.759, p < .001), with an adjusted R-squared of 0.38.

Interpretation of Coefficients

Although there are only 8 variables, there are some 21 coefficients, a consequence of the number of levels in the categorical variables. The coefficients estimates are identified in Table 23.

Table 23: Model Alpha Coefficients
term estimate std.error statistic p.value
(Intercept) 8.356 2.816 2.967 0.003
cast_scores 0.001 0.000 3.427 0.001
genreAnimation -0.743 0.856 -0.869 0.386
genreArt House & International -2.541 0.638 -3.981 0.000
genreComedy -1.329 0.346 -3.845 0.000
genreDocumentary -4.065 0.498 -8.162 0.000
genreDrama -2.028 0.296 -6.848 0.000
genreHorror -0.681 0.568 -1.200 0.231
genreMusical & Performing Arts -3.629 0.762 -4.763 0.000
genreMystery & Suspense -1.454 0.380 -3.829 0.000
genreOther -1.818 0.604 -3.008 0.003
genreScience Fiction & Fantasy -0.528 0.707 -0.746 0.456
critics_score 0.017 0.004 4.900 0.000
mpaa_ratingPG -0.108 0.609 -0.177 0.860
mpaa_ratingPG-13 0.751 0.625 1.202 0.230
mpaa_ratingR 0.363 0.602 0.602 0.547
mpaa_ratingUnrated -0.847 0.717 -1.182 0.238
cast_votes 0.000 0.000 2.698 0.007
best_pic_nomyes 1.052 0.462 2.279 0.023
runtime_log 0.765 0.427 1.793 0.074
best_dir_winyes 0.524 0.341 1.538 0.125

The intercept estimate, 8.356 , is the regression estimate for the mean log number of IMDB votes for an action and adventure film, launched in January with no oscar wins or nominations and zeros for all of the other variables. The other coefficient estimates adjust the estimate accordingly. Therefore a prediction for the log number of IMDB votes is equal to: * the intercept value, 8.356, * plus 0.001 log IMDB votes for each composite score point earned by the cast members, * plus a number of log IMDB votes associated with the genre of the film, * plus 0.017 log IMDB votes for each point of the Rottentomatoes critics score, * plus a number of log IMDB votes for the associated MPAA rating, * plus 0 log IMDB votes for the each vote previously earned by the cast members, * plus 1.052 log IMDB votes if the film was nominated for an oscar for best film. * plus 0.765 log IMDB votes for each log minute of runtime, * plus 0.524 log IMDB votes if the film won an oscar for best picture.

Model Summary

The purpose of this section was to develop a model that would be able to predict “box office success”. Given the signficant right skew in box office revenue, the log of box office revenue became the proxy for box office success. Therefore, two regression models were fit in this section. Model One, the simple linear regression model (F(2, 177) = 332.24, p < .001) showed that the log number of IMDB votes was the best predictor of the log of box office revenue. Designated log IMDB votes as the response variable, Model Two (F(21, 462) = 15.76, p < .001) was selected from among four multiregression linear models employing forward selection and backward elimination algorithms. Next, the models will be used to predict the number of log IMDB votes and the log box office for a randomly selected film.


References

John James jjames@datasciencesalon.org

21 November, 2017